The relative abundance of languages: Neutral and non-neutral dynamics

Credible estimates suggest that a large number of the nearly 7000 languages in the world could go extinct this century, a prospect with profound cultural, socioeconomic, and political ramifications. Despite its importance, we still have little predictive theory for language dynamics and richness. Critical to the language extinction problem, however, is to understand the dynamics of the number of speakers of languages, the dynamics of language abundance distributions (LADs). Many regional LADs are very similar to the bell-shaped distributions of relative species abundance predicted by neutral theory in ecology. Using the tenets of neutral theory, here we show that LADs can be understood as an equilibrium or disequilibrium between stochastic rates of origination and extinction of languages. However, neutral theory does not fit some regional LADs, which can be explained if the number of speakers has grown systematically faster in some languages than others, due to cultural factors and other non-neutral processes. Only the LADs of Australia and the United States, deviate from a bell-shaped pattern. These deviations are due to the documented higher, non-equilibrium extinction rates of low-abundance languages in these countries.


Introduction
Linguistic richness, defined as the number of languages, is not evenly distributed on Earth [1,2], with the majority of the language-rich countries situated in the tropics (e.g., [3]). Likewise, the number of speakers is not evenly distributed among languages; some languages have hundreds of millions of speakers while others only a few [4]. Similar patterns of richness and abundance are found in ecology. One of the recent advances in ecology has been the development of the Neutral Theory of Biodiversity and Biogeography [5] (hereafter NTBB) to explain patterns of relative species abundance, as measured for example by the number of individuals, on local to global scales. One prediction of NTBB theory is the patterns of relative species abundance to expect under the assumption of a dynamic equilibrium between the origin and extinction of species. The theory predicts that the steady-state distribution of species abundances will be bell-shaped on a logarithmic scale when species undergo fission into daughter a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 reminiscent of the Theory of Island Biogeography [9] in which the unit was the species. In these cases, the neutrality is at the species, or political group, level, meaning they all have equal functional properties relevant to the dynamics of languages. In contrast, in NTBB, including our application to languages, the unit is the individual (or speaker) and the assumption is that all individuals are equal in their per capita rates of reproducing (transmitting the language to the offspring), dying and giving rise to a new language, the latter known as glossogenesis.
There are several versions of the NTBB depending on the speciation (or glossogenesis) mode adopted [10,11]; and see (S3 Appendix). For present purposes, we assume that each incipient language (within a homogeneous region) starts with the same fixed number of individuals. Clearly, the assumption of the same starting number of individuals is a first approximation, but in broad agreement with previous estimates of initial population sizes [12]. Under this assumption, Allen and Savage [13] derived the equilibrium relative abundance distribution of the NTBB. An equilibrium relative abundance distribution means that the number of languages and the number of speakers distributed among languages is the same over time, implying that the total number of individuals is constant. However, this equilibrium is a dynamic steady-state, and does not imply that the languages in a community are always the same: some may go extinct and new ones may arise. Nor does it assume that languages always have the same number of speakers: some languages may increase in number of speakers while others may decrease in numbers.
The two parameters of the Allen-Savage distribution are J S , the incipient size of the population, and ν, the per capita glossogenesis (speciation) rate, both capturing important characteristics of language dynamics. In fact, estimating these parameters can help identify regions in the globe with different rates of species origination and the typical size of a human group. This observation is important because the bell-shape of the LADs, when plotted in a logarithmic scale, makes the lognormal a natural candidate to fit these distributions. However, the parameters of the lognormal distribution, the mean and variance, do not have any process interpretation whereas the parameters of the distribution suggested by Allen and Savage are readily interpreted mechanistically. However, for comparison with other published accounts, we will also fit the data using the lognormal distribution.
Although the assumption of equality at individual level has been controversial in ecology (e.g., [14]), the neutrality assumption of all language speakers is less likely to generate controversy, under the assumption that all individuals are broadly subject to the same social and environmental conditions. This in line with the applications of NTBB in ecology where it is assumed that all individuals belong to the same trophic level and the same biogeographic region (e.g., the Amazon basin). Despite being less controversial, the Allen-Savage distribution does not provide a good fit in all situations. This is likely due to non-neutral processes, in particular differential growth in the number of speakers among different languages. Indeed, although the human population has been growing globally in the last centuries, not all languages have been growing at the same rate, in fact, some have disappeared. What is remarkable is that non-neutral growth differences among some linguistic groups lead to clear departures from the LADs predicted by the Allen-Savage distribution, allowing us to distinguish which LADs are consistent with neutrality and which are not.

The model
Here we treat each country as a self-contained unit where the processes of death, birth and glossogenesis occur; thus, a country is the equivalent of a biogeographic region in the original NTBB. The purpose of choosing country as the unit of analysis (and not as, say, continental or global scale) is to ensure that individuals are subject to similar conditions. Clearly, countries are not closed systems; migrations occur and often the same language is spoken in different countries. The advantage of using a country as the basic spatial unit is that we consider, as a first approximation, that its linguistic populations are subjected to the similar environmental and social conditions. On the other hand, we assume that if migrations occur these do not have a strong impact in the overall language abundance distribution; if this is not the case, such is the case in periods of social upheaval, the model we now describe will not apply.
Allen and Savage [13] derived the abundance distribution in the case of a biogeographic region under the assumptions of the NTBB and by supposing that an incipient language starts with a constant number of individuals. When applied to language dynamics the assumptions of the Allen-Savage model for a given country are that (i) the average number of speakers of an incipient language is J S , (ii) the total population of speakers of all languages, J, in the country of interest is very large relative to J S , (iii) the total origination rate of languages equals the extinction rate, (iv) different populations have similar per capita rate of glossogenesis and (v) the total population, J, fluctuates stochastically around a mean value. Some of these assumptions are unrealistic to present human populations. For instance, assumption (iii) implies that the number of languages is constant over time, and we know that at present a large number of languages is becoming extinct. Equally, assumption (v) does not hold at the present given the fast growth of most human populations. We will discuss their broad implications in due time.

The Allen-Savage distribution
The two parameters of the Allen-Savage model to be estimated from the data are the size of the population speaking a new language, J S , or the fraction P S = J S /J, and the rate of origination of languages, glossogenesis, ν, which is usually combined with the total population size of the country, J, to form the parameter θ = 2Jν. The parameter θ is a dimensionless language diversity number, corresponding to the fundamental biodiversity number of NTBB, that reflects the fact that the richness of languages is a function of the per capita rate of glossogenesis, ν, and of the total size of the population, J. We use likelihood methods to estimate the parameters P S (or J S ) and θ (or ν). However, the exact likelihood formula [15] (Etienne and Alonso, 2005) is computationally demanding, therefore, we use an approximate likelihood formula [10,16] based on the following considerations. Consider the probability of a language having N speakers, p (N) in a population of J speakers of all languages. Probability p(N) can be estimated from the ratio of the expectation of the number of languages with N speakers, E(L N |J) to the expectation of the language richness, E(L|J), Then the likelihood function, ', is [10,16] ' Under the assumption that an incipient language starts with a constant number of individuals, J S , for a country with J speakers distributed among L languages, the probability, p(N), of finding languages with N speakers is [13] pðNÞ ¼ where γ �0.57721 is the Euler-Mascheroni's constant, and E 1 (x) is the exponential integral function [17]. We refer to this distribution as the "Allen-Savage distribution".

The data
We used data on languages and number of speakers per country from the Ethnologue [4]. The reason for choosing country as the unit of analysis (and not, say, continental or global scales) is to ensure that individuals are subject to similar conditions. However, this leaves the question of the environmental heterogeneity within a country, and how this heterogeneity affects the in-country linguistic diversity, unanswered. Obviously, contingent historical events determine the richness of languages but environmental factors are also likely to play a key role. According to Nettle [1], language richness is a function of the ecological risk, especially is non-industrial societies. By ecological risk it is meant the degree by which human populations are exposed to the vagaries of their natural environment. The justification is that when ecological risk is high, human populations are more dependent on each other and, thus, local language differences do not diverge to the point of becoming mutually unintelligible. On the other hand, if ecological risk is low populations are less dependent on each other and local language variations can more easily diverge forming new languages. Therefore, the higher the ecological risk the smaller the number of languages. In order to measure ecological risk, Nettle [1] used the mean growing season, defined as the number of months in which the monthly rainfall (in millimeters) is greater than twice the monthly temperature ([1], p. 82). Although we acknowledge that this is a very simple measure of ecological risk, of the environmental determinants of linguistic diversity (but see, [18][19][20], we will use it as a first approximation to guarantee relatively homogeneous regional units. This reduces the number of countries to 46 (i.e., those with standard deviation of the growing season smaller than two months) and to a total of 4099 languages.

Results
Using the above likelihood method, we obtained values for parameters P S (or J S ) and θ (or ν). Table 1 shows the results for countries with more than 50 languages; (S1 Appendix) shows the results for all countries studied. In Fig 1 we show examples of fitted distributions; and in (S1 Appendix) we show the fitting to all countries. We distinguish two situations: those for which the Allen-Savage distribution is bell-shaped and gives a good fit, Fig 1A-1F (Solomon Islands, Cameroon and Papua New Guinea), and those for which the Allen-Savage distribution exhibits a plateau at intermediate language abundances and gives a poor fit, Fig 1G-1L (Colombia, Indonesia and Philippines). As previously mentioned, we also fitted the language abundance distributions with the lognormal distribution (blue line in Fig 1). Unlike the Allen Savage distribution, the lognormal distribution never exhibits a plateau. To assess the fit of the Allen-Savage and lognormal distributions, we used the ratio of the Akaike weights [21]. Excluding Colombia and the Asian countries for which the Allen-Savage distribution gives a poor fit, in most cases there is not a clear best distribution (Table 2). When there is a best distribution, it is usually the Allen-Savage distribution (e.g., Cameroon). A notable exception is Papua New Guinea, where the lognormal provides a better description; nevertheless, visual inspection and the confidence interval plot, Fig 1F, do not reveal a clear advantage to the lognormal. We urge caution when interpreting the values of J S and ν (Table 1) because these values are estimated from countries where populations have been growing. Relating these estimated values of J S with those of the typical size of an ethnic group (e.g., [22]) originating a language would give a wrong estimate. As we discuss in the next section, if all populations grow at the same rate, what remains constant is P S = J S /J and θ = 2νJ, and any attempts to relate the values of J S and ν with any real attributes of the populations have to consider the total size of the populations at a point in time when those populations were under the equilibrium assumptions of the Allen-Savage distribution; see also (S4 Appendix).

Discussion
We introduced a model to describe the relative abundance of languages, as measured by their number of speakers, based on the neutral theory of biodiversity and biogeography [5]. We assumed that languages arise with a probability of origination ν and specific number of speakers J S ; under these assumptions the relative abundance of languages is described by the Allen-Savage distribution. (We also explored the fitting of the languages distributions under the assumption of variable J S but with poor results; see (S2 Appendix)). One of the strengths of our approach is that the parameters of the Allen-Savage distribution are readily interpreted in terms of the probability of a new language arising, ν, and the size of the its population at the origin, J S . The estimation of these parameters using realistic (total) population sizes may help Table 1. List of countries with more than 50 languages, their number of languages, number of individuals, J, and the maximum likelihoods of θ = 2Jν, ν (the glossogenesis rate), P s (the fraction of the number of individuals, relatively to J, of an incipient language), and J s = P s � J. Countries with name in italic correspond to cases of poor fitting, as revealed by the rank abundance plots. See (S1 Appendix) for the complete list of countries. identify regions where large number of languages arose and the of the typical size of ethnolinguistic group originating a new language, and how these parameters vary among different regions of the globe. However, the interpretation of the numerical values of ν and J S warrants some considerations because human populations are not in the equilibrium conditions assumed by the Neutral model, as we now discuss; see also (S4 Appendix). When analyzing the dynamics of human populations an unavoidable consideration is their (very fast) growth in recent centuries, in particular, in the 20 th century. This is in clear contradiction to one of the assumptions of the Allen-Savage distribution. Therefore, exploring the implications of growth to explain the failure of the Allen-Savage distribution to fit some distributions is in order. Moreover, we should not expect all populations to grow at the same rate, especially, if populations are identified by an attribute, such as language. In fact, among some linguistic groups the number of speakers has decreased, eventually to the point of extinction. This does not necessarily imply that, when the number of speakers decreases, the individuals representing those speakers are no longer present in the community. They may have been eliminated, as in the case of genocide, but in other cases speakers may have been forced, or they may have made a voluntary shift, from one language to another, as when, for instance, parents do not teach their children their mother tongues and instead adopt another language perceived as having more prestige or bringing more socio-economic benefits, such as access to education.

Continent
The development of a plateau for some Allen-Savage curves is an important result of this work, and we argue that this is a signature of non-neutral dynamics. Among LADs with a plateau one can distinguish two distinct potential mechanisms of non-neutral dynamics. One Table 2

. Corrected Akaike Information Criterion values (AICc) for the Allen-Savage (AS) and lognormal (logn) distributions, their weights (w) (Burnham and
Anderson 2010) and their ratio, w AS /w logn , for the countries with more than 50 languages. See (S1 Appendix) for the complete list of countries.

Country
AICc mechanism is exemplified by Colombia (Fig 1G), the other by Indonesia and Philippines ( Fig  1I and 1K). In Colombia´s case there is one language (Spanish) that has a much larger number of speakers than all the rest; see also the LADs of Myanmar and Vietnam in the (S1 Appendix).
If we remove the largest language, then the Allen-Savage distribution gives a good fit to the remaining distribution. This example also serves to show that the maximum likelihood estimators of the Allen-Savage distribution are affected by extreme values, while those of the lognormal distribution are not. In the situation illustrated by Indonesia and Philippines (Fig 1I-1K) there is not a language with a much larger (isolated) number of speakers, like Spanish in Colombia's LAD, but nevertheless the Allen-Savage distribution still exhibits an extended plateau. The obvious departure of a bell-shaped liked distribution is particularly obvious for Philippine's LAD, where, in fact, even the lognormal gives a poor fit. We interpret the results of the previous paragraph as revealing the effects of recent differential populations growth rates. In the following we present an explanation for the plateau exhibited by some fitting distributions. First, consider the situation in which, in a given geographical region the, number of speakers of each language grows at the same rate as all other languages. For simplicity, assume that the period of observation of the growth of languages is short, so that no new languages emerge. Also assume that at the beginning of the observation period, t = 0, the size of the linguistic groups is at the equilibrium given by Eq 1. Now let these populations collectively start to grow. If all populations have the same growth rate, then the number of speakers of a language i, at a time t>0, N it , relates to the number at time t = 0, N i0 , as N it = C t N i0 , where C t is the same for all populations. In this case the total population size at time t is J t = C t J 0 . In terms of the histograms of the log transformed values of the number of speakers, such as depicted in Fig 1, the growth of the populations at the same rate corresponds to a shift by log 2 (C t ) of the distributions to the right in the x-axis because log 2 (N it ) = log 2 (C t ) + log 2 (N i0 ). Since the shift, log 2 (C t ), is the same for all populations, only the mean of the distribution changes but not its variance. Notice that the parameters θ and P s of Eq 1 are the same, because a change of variable from N to P = N/J does not change the analytic expression of Eq 1, and the distributions for t6 ¼0 will have different rate of glossogenesis, ν, and incipient population size, J S ; See (S2 Appendix). In other words, the parameters ν and J S estimated from a group of populations that have been growing are different from those of the same populations when their sizes are in equilibrium, but θ and P s are the same. Now consider a different situation, one in which languages have different growth rates. Using the same notation as before, N it = C it N i0 , but C it is no longer the same for all languages, hence the index i. The corresponding log transformed values are log 2 (N it ) = log 2 (C it ) + log 2 (N i0 ). When log 2 (C it ) is not the same for all languages, the shift they experience along the x-axis when t increases is not the same, and the resulting LAD does not have the same shape as the original distribution. What remains to be shown is the implication of differential growth to the fit provided by the Allen-Savage distribution. For simplicity, assume that C i >1 (all populations grow) and assume that larger languages have an advantage over smaller ones, that is, larger languages have larger C i . For the sake of example, assume that C it = exp(r i t) and that r i is proportional to the logarithm of the number of speakers at t = 0, r i = Dlog(N i0 ), where D is a positive constant (this does not have to be the case; it is just to obtain an simpler mathematical expression). Then leading to distributions shifting to the right and, simultaneously, becoming wider as t increases. The important point is that if we attempt to fit these distributions with the Allen-Savage distribution, we observe that, as time increases, the fitting curves develop a plateau. To illustrate this, we use the LAD of Cameroon, a distribution that is well fitted by the Allen-Savage distribution. It we allow its languages to growth at different rates, the fitted Allen-Savages distributions develop a plateau that becomes more pronounced as t increases, as illustrated in Fig 2. Note that the LAD of Cameroon at t = 0 is the real one; thus the plateau of the fitted distribution depicted in Fig 2 could be a prediction of our model if the Cameroon languages were to start growing at different rates.
In summary, if a LAD is initially described by the Allen-Savage distribution and the populations start growing at approximately the same rate (neutral growth), then the resulting LADs are well fitted by the Allen-Savage distribution, with the same θ and P S values. On the other hand, if populations have differential (non-neutral) growth rates, then the fitted Allen-Savage distributions develop a plateau at intermediate language abundances that widens over time as populations grow. Therefore, a plausible explanation for the plateau in Fig 1G, 1I and 1K, see (S1 Appendix) is the differential, non-neutral, growth of languages.
Two observations from our results are worth reporting. The first is that most LADs of African countries are well fitted by the Allen-Savage distribution; see (S4 Appendix). According to our previous discussion this could be explained by neutral (or near-neutral) growth among African linguistic groups. It is outside the scope of this work to identify the causes for such non-differential (neutral) growth, but these are likely to lie in the degree of centralization of political power or the enforcement of a few selected languages in education, usually those languages having higher prestige or spoken by larger ethnic groups. The second observation is that only Australia and the United States do not have bell shaped LADs, which are, instead, truncated bell-shaped distributions, being almost monotonically decreasing curves (Fig 3). However, these LADs are not evidence against the generality of the bell-shape pattern that arises under steady state conditions of language origination and extinction. Indeed, patterns such as those in the U.S.A. and Australia can arise from former LAD distributions that were once bell-shaped but have subsequently been modified by processes causing the number of speakers of the majority of languages to decline. Possible processes include forced or voluntary language shift to higher status languages, population declines due to European-introduced diseases, and genocide [23]. These processes shift the mode of the LAD distribution towards lower abundances, resulting in the observed truncated LAD distributions. What distinguishes the United States and Australia is that a large number of languages with a small number of speakers still remains, although many of these low-abundance languages are on the verge of extinction, thus the observed LADs are likely to represent a short transient period.
Finally, some considerations on the use of the lognormal distribution are in order. Previous work on language abundance distributions [24][25][26][27][28][29][30] emphasized the lognormal. Although the lognormal provides reasonably good fits, there is no demographic interpretation of the parameters, so that fitting a lognormal does not lead to further hypotheses to test. In contrast, all parameters of neutral theory applied to language abundance dynamics have demographic interpretations that generate testable hypotheses. . Among all the countries studied, these were the only distributions that did not conform to the bell-shaped pattern. These skewed distributions reflect the decreasing sizes and higher extinction rates of low-abundance languages in these countries. The blue curves are the best-fit lognormal distributions.

Conclusions
We developed a new theory of the dynamics of languages that includes both origination and extinction under either equilibrium or non-equilibrium stochastic processes. We show that some language abundance distributions exhibit near-neutral dynamics, whereas others exhibit non-neutral dynamics. There are sufficient parallels between species and linguistic groups to suggest that a theoretical perspective similar to that developed by the NTBB in ecology might be useful in understanding the dynamics of language abundances that shape language diversity.
An important aspect of our approach is that we considered the relative number of speakers as a major determinant of linguistic diversity. We argue that any attempt to describe the dynamics of a system, or to identify causal relationships among its patterns and processes, that do not consider the relative abundance of its constituents is likely to miss an important determinant of its behavior; see also [5].
We anticipate that further development of a theory for language diversity will generate a wealth of testable hypotheses on language diversity and the underlying environmental and societal processes driving language dynamics, and it will bring changes in the respect for and protection of minorities' languages and cultures.